Extract Structured Data From Text: Expert Mode (Using Kor)#
For complicated data extraction you need a robust library. The Kor Library (created by Eugene Yurtsev) is an awesome tool just for this.
We are going to explore using Kor with a practical use case.
Why is this important? LLMs are great at text output, but they need extra help outputing information in a structure that we want. A common request from developers is to get JSON data back from our LLMs.
Spoiler: Jump down to the bottom to see a bonefied business idea that you can start and manage today.
# Unzip data folder
import zipfile
with zipfile.ZipFile('../../data.zip', 'r') as zip_ref:
zip_ref.extractall('..')
# Kor!
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number
# LangChain Models
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
# Standard Helpers
import pandas as pd
import requests
import time
import json
from datetime import datetime
# Text Helpers
from bs4 import BeautifulSoup
from markdownify import markdownify as md
# For token counting
from langchain.callbacks import get_openai_callback
def printOutput(output):
print(json.dumps(output,sort_keys=True, indent=3))
# It's better to do this an environment variable but putting it in plain text for clarity
openai_api_key = 'your_api_key'
openai_api_key = '...'
Letâs start off by creating our LLM. Weâre using gpt4 to take advantage of its increased ability to follow instructions
llm = ChatOpenAI(
# model_name="gpt-3.5-turbo", # Cheaper but less reliable
model_name="gpt-4",
temperature=0,
max_tokens=2000,
openai_api_key=openai_api_key
)
Kor Hello World Example#
Create an object that holds information about the fields youâd like to extract
person_schema = Object(
# This what will appear in your output. It's what the fields below will be nested under.
# It should be the parent of the fields below. Usually it's singular (not plural)
id="person",
# Natural language description about your object
description="Personal information about a person",
# Fields you'd like to capture from a piece of text about your object.
attributes=[
Text(
id="first_name",
description="The first name of a person.",
)
],
# Examples help go a long way with telling the LLM what you need
examples=[
("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
]
)
Create a chain that will extract the information and then parse it. This uses LangChain under the hood
chain = create_extraction_chain(llm, person_schema)
text = """
My name is Bobby.
My sister's name is Rachel.
My brother's name Joe. My dog's name is Spot
"""
output = chain.predict_and_parse(text=(text))["data"]
printOutput(output)
# Notice how there isn't "spot" in the results list because it's the name of a dog, not a person.
{
"person": [
{
"first_name": "Bobby"
},
{
"first_name": "Rachel"
},
{
"first_name": "Joe"
}
]
}
Kor also facilitates returning None when the LLM doesnât find what youâre looking for
output = chain.predict_and_parse(text=("The dog went to the park"))["data"]
printOutput(output)
{
"person": []
}
Multiple Fields#
You can pass multiple fields if youâre looking for more information
plant_schema = Object(
id="plant",
description="Information about a plant",
# Notice I put multiple fields to pull out different attributes
attributes=[
Text(
id="plant_type",
description="The common name of the plant."
),
Text(
id="color",
description="The color of the plant"
),
Number(
id="rating",
description="The rating of the plant."
)
],
examples=[
(
"Roses are red, lilies are white and a 8 out of 10.",
[
{"plant_type": "Roses", "color": "red"},
{"plant_type": "Lily", "color": "white", "rating" : 8},
],
)
]
)
text="Palm trees are brown with a 6 rating. Sequoia trees are green"
chain = create_extraction_chain(llm, plant_schema)
output = chain.predict_and_parse(text=text)['data']
printOutput(output)
{
"plant": [
{
"color": "brown",
"plant_type": "Palm tree",
"rating": "6.0"
},
{
"color": "green",
"plant_type": "Sequoia tree",
"rating": ""
}
]
}
Working With Lists#
You can also extract lists as well.
Note: Check out how I have a nested object. The âpartsâ object is in the âcars_schemaâ
parts = Object(
id="parts",
description="A single part of a car",
attributes=[
Text(id="part", description="The name of the part")
],
examples=[
(
"the jeep has wheels and windows",
[
{"part": "wheel"},
{"part": "window"}
],
)
]
)
cars_schema = Object(
id="car",
description="Information about a car",
examples=[
(
"the bmw is red and has an engine and steering wheel",
[
{"type": "BMW", "color": "red", "parts" : ["engine", "steering wheel"]}
],
)
],
attributes=[
Text(
id="type",
description="The make or brand of the car"
),
Text(
id="color",
description="The color of the car"
),
parts
]
)
# To do nested objects you need to specify encoder_or_encoder_class="json"
text = "The blue jeep has rear view mirror, roof, windshield"
# Changed the encoder to json
chain = create_extraction_chain(llm, cars_schema, encoder_or_encoder_class="json")
output = chain.predict_and_parse(text=text)['data']
printOutput(output)
{
"car": {
"color": "blue",
"parts": [
{
"part": "rear view mirror"
},
{
"part": "roof"
},
{
"part": "windshield"
}
],
"type": "jeep"
}
}
View the prompt that was sent over
prompt = chain.prompt.format_prompt(text=text).to_string()
print(prompt)
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.
```TypeScript
car: { // Information about a car
type: string // The make or brand of the car
color: string // The color of the car
parts: { // A single part of a car
part: string // The name of the part
}
}
```
Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.
Input: the bmw is red and has an engine and steering wheel
Output: <json>{"car": [{"type": "BMW", "color": "red", "parts": ["engine", "steering wheel"]}]}</json>
Input: the jeep has wheels and windows
Output: <json>{"car": {"parts": [{"part": "wheel"}, {"part": "window"}]}}</json>
Input: The blue jeep has rear view mirror, roof, windshield
Output:
Kor is a really great way to extract actions from a user as well
schema = Object(
id="forecaster",
description=(
"User is controling an app that makes financial forecasts. "
"They will give a command to update a forecast in the future"
),
attributes=[
Text(
id="year",
description="Year the user wants to update",
examples=[("please increase 2014's customers by 15%", "2014")],
many=True,
),
Text(
id="metric",
description="The unit or metric a user would like to influence",
examples=[("please increase 2014's customers by 15%", "customers")],
many=True,
),
Text(
id="amount",
description="The quantity of a forecast adjustment",
examples=[("please increase 2014's customers by 15%", ".15")],
many=True,
)
],
many=False,
)
chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')
output = chain.predict_and_parse(text="please add 15 more units sold to 2023")['data']
printOutput(output)
{
"forecaster": {
"amount": [
"15"
],
"metric": [
"units sold"
],
"year": [
"2023"
]
}
}
Opening Attributes - Real World Example#
Opening Attributes (my sample project for this application)
If anyone wants to strategize on this project DM me on twitter
llm = ChatOpenAI(
# model_name="gpt-3.5-turbo",
model_name="gpt-4",
temperature=0,
max_tokens=2000,
openai_api_key=openai_api_key
)
We are going to be pulling jobs from Greenhouse. No API key is needed.
def pull_from_greenhouse(board_token):
# If doing this in production, make sure you do retries and backoffs
# Get your URL ready to accept a parameter
url = f'https://boards-api.greenhouse.io/v1/boards/{board_token}/jobs?content=true'
try:
response = requests.get(url)
except:
# In case it doesn't work
print ("Whoops, error")
return
status_code = response.status_code
jobs = response.json()['jobs']
print (f"{board_token}: {status_code}, Found {len(jobs)} jobs")
return jobs
Letâs try it out for Okta
jobs = pull_from_greenhouse("okta")
okta: 200, Found 142 jobs
Letâs look at a sample job with itâs raw dictionary
# Keep in mind that my job_ids will likely change when you run this depending on the postings of the company
job_index = 0
print ("Preview:\n")
print (json.dumps(jobs[job_index])[:400])
Preview:
{"absolute_url": "https://www.okta.com/company/careers/opportunity/4977199?gh_jid=4977199", "data_compliance": [{"type": "gdpr", "requires_consent": false, "requires_processing_consent": false, "requires_retention_consent": false, "retention_period": null}], "internal_job_id": 2518868, "location": {"name": "Melbourne "}, "metadata": null, "id": 4977199, "updated_at": "2023-04-05T22:41:12-04:00", "
Letâs clean this up a bit
# I parsed through an output to create the function below
def describeJob(job_description):
print(f"Job ID: {job_description['id']}")
print(f"Link: {job_description['absolute_url']}")
print(f"Updated At: {datetime.fromisoformat(job_description['updated_at']).strftime('%B %-d, %Y')}")
print(f"Title: {job_description['title']}\n")
print(f"Content:\n{job_description['content'][:550]}")
Weâll look at another job. This job_id may or may not work for you depending on if the position is still active.
# Note: I'm using a hard coded job id below. You'll need to switch this if this job ever changes
# and it most definitely will!
job_id = 4982726
job_description = [item for item in jobs if item['id'] == job_id][0]
describeJob(job_description)
Job ID: 4982726
Link: https://www.okta.com/company/careers/opportunity/4982726?gh_jid=4982726
Updated At: April 11, 2023
Title: Staff Software Engineer
Content:
<div class="content-intro"><p><span style="color: #000000;"><strong>Get to know Okta</strong></span></p>
<p><span style="color: #000000;"><br></span>Okta is The Worldâs Identity Company. We free everyone to safely use any technologyâanywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at t
I want to convert the html to text, weâll use BeautifulSoup to do this. There are multiple methods you could choose from. Pick whatâs best for you.
soup = BeautifulSoup(job_description['content'], 'html.parser')
text = soup.get_text()
# Convert your html to markdown. This reduces tokens and noise
text = md(text)
print (text[:600])
**Get to know Okta**
Okta is The Worldâs Identity Company. We free everyone to safely use any technologyâanywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth.Â
At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box, weâre looking for lifelong learners and people who can make us better with their unique experien
Letâs create a Kor object that will look for tools. This is the meat and potatoes of the application
tools = Object(
id="tools",
description="""
A tool, application, or other company that is listed in a job description.
Analytics, eCommerce and GTM are not tools
""",
attributes=[
Text(
id="tool",
description="The name of a tool or company"
)
],
examples=[
(
"Experience in working with Netsuite, or Looker a plus.",
[
{"tool": "Netsuite"},
{"tool": "Looker"},
],
),
(
"Experience with Microsoft Excel",
[
{"tool": "Microsoft Excel"}
]
),
(
"You must know AWS to do well in the job",
[
{"tool": "AWS"}
]
),
(
"Troubleshooting customer issues and debugging from logs (Splunk, Syslogs, etc.) ",
[
{"tool": "Splunk"},
]
)
],
many=True,
)
chain = create_extraction_chain(llm, tools, input_formatter="triple_quotes")
output = chain.predict_and_parse(text=text)["data"]
printOutput(output)
{
"tools": [
{
"tool": "Okta"
},
{
"tool": "Java"
},
{
"tool": "Hibernate"
},
{
"tool": "Spring Boot"
},
{
"tool": "AWS"
},
{
"tool": "GCP"
},
{
"tool": "SQL"
},
{
"tool": "ElasticSearch"
},
{
"tool": "Docker"
},
{
"tool": "Kubernetes"
}
]
}
Salary#
Letâs grab salary information while we are at it.
Not all jobs will list this information. If they do, itâs rarely consistent across jobs. A great use case for LLMs to catch this information!
salary_range = Object(
id="salary_range",
description="""
The range of salary offered for a job mentioned in a job description
""",
attributes=[
Number(
id="low_end",
description="The low end of a salary range"
),
Number(
id="high_end",
description="The high end of a salary range"
)
],
examples=[
(
"This position will make between $140 thousand and $230,000.00",
[
{"low_end": 140000, "high_end": 230000},
]
)
]
)
jobs = pull_from_greenhouse("cruise")
cruise: 200, Found 219 jobs
# This job id may not work for you, pick another one from the list if it doesn't.
job_id = 4858414
job_description = [item for item in jobs if item['id'] == job_id][0]
describeJob(job_description)
soup = BeautifulSoup(job_description['content'], 'html.parser')
text = soup.get_text()
# Convert your html to markdown. This reduces tokens and noise
text = md(text)
print (text[:600])
Job ID: 4858414
Link: https://boards.greenhouse.io/cruise/jobs/4858414?gh_jid=4858414
Updated At: April 12, 2023
Title: Senior Data Center Technician
Content:
<div class="content-intro"><p><span style="font-weight: 400;">We're Cruise, a self-driving service designed for the cities we love.</span></p>
<p><span style="font-weight: 400;">Weâre building the worldâs most advanced self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.</span>
We're Cruise, a self-driving service designed for the cities we love.
Weâre building the worldâs most advanced self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.
In our cars, youâre free to be yourself. Itâs the same here at Cruise. Weâre creating a culture that values the experiences and contributions of all of the unique individuals who collectively make up Cruise, so that every employee can do thei
chain = create_extraction_chain(llm, salary_range)
output = chain.predict_and_parse(text=text)["data"]
printOutput(output)
{
"salary_range": [
{
"high_end": "165000",
"low_end": "112300"
}
]
}
The salary range for this position is $112,300 - 165,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, restricted stock units, and benefits. These ranges are subject to change.
Awesome!
with get_openai_callback() as cb:
result = chain.predict_and_parse(text=text)
print(f"Total Tokens: {cb.total_tokens}")
print(f"Prompt Tokens: {cb.prompt_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Successful Requests: {cb.successful_requests}")
print(f"Total Cost (USD): ${cb.total_cost}")
Total Tokens: 1768
Prompt Tokens: 1757
Completion Tokens: 11
Successful Requests: 1
Total Cost (USD): $0.053369999999999994
Suggested To Do if you want to build this out:
Reduce amount of HTML and low-signal text that gets put into the prompt
Gather list of 1000s of companies
Run through most jobs (Youâll likely start to see duplicate information after the first 10-15 jobs per department)
Store results
Snapshot daily as you look for new jobs
Follow Greg on Twitter for more tools or if you want to chat about this project
Read the user feedback below for what else to build out with this project (I reached out to everyone who signed up on twitter)
Business idea: Job Data As A Service#
Start a data service that collects information about companyâs jobs. This can be sold to investors looking for an edge.
After posting this tweet there were 80 people that signed up for the trial. I emailed all of them and most were job seekers looking for companies that used the tech they specialized in.
The more interesting use case were sales teams + investors.
Interesting User Feedback (Persona: Investor):#
Hey Gregory, thanks for reaching out.
I always thought that job posts were a gold mine of information, and often suggest identifying targets based on these (go look at relevant job posts for companies that might want to work with you). Secondly, I also automatically ping BuiltWith from our CRM and send that to OpenAI and have a summarized tech stack created - so I see the benefit of having this as an investor.
For me personally, I like to get as much data as possible about a company. Would love to see job post cadence, type of jobs they post and when, notable keywords/phrases used, tech stack (which you have), and any other information we can glean from the job posts (sometimes they have the title of who youâll report to, etc.).
For sales people, I think finer searches, maybe even in natural language if possible - such as âsearch for companies who posted a data science related job for the first timeâ - would be powerful.
If you do this, let me know! Iâd love to hear how it goes.